Method for extracting, merging and ranking search engine results

ABSTRACT

A method and a computer program product for identifying the domains, selecting for each domain one domain-specific search engine and data source to be involved, generating the domain-specific subqueries for each selected search engine, defining a strategy for sending requests to each search engine and data source, and receiving, merging and ranking results. The result of the multi-domain query is a list of combinations, where every combination consists of a tuple of data, each relative to one of the domains of the query; such data is present in the results returned either by search engines or by data sources. The method provides the combinations having the highest combination score, as computed by a monotone aggregation function over the combinations.

FIELD OF THE INVENTION

The present invention generally relates to answering multi-domainqueries that require invoking several domain-specific search engines. Inparticular, the present invention relates to a method to extract, mergeand rank results of subqueries issued to one search engine and one datasource for each said domain. It details an algorithm that computes anoptimal strategy that minimizes the expected total cost associated withthe invocations to said search engines and data sources needed tocalculate the best K answers to the multi-domain query posed by theuser.

BACKGROUND OF THE INVENTION

The World Wide Web (WWW) is a connected collection of computers offeringto world-wide spread users the possibility to extract information inresponse to queries. The process of information extraction includesaccessing information which is stored, in the format of Web pages, intoa class of computer systems, called Data Sources; in order to locatesuch Web Pages, users rely on another class of computer systems, calledSearch Engines, whose specific ability is to extract those Web Pageswhich with higher probability contain information that is relevant tothe query. Given the huge number of Web Pages and the breadth ofinformation which is available on the World Wide Web, the spreading andsuccess of Search Engines has increasingly grown in the last decade.

A typical search engine offers an interface where the users enter aquery expression consisting of keywords, sometimes interconnected bylogical operators. The search engine uses pre-calculated indexstructures in order to produce the results that best match with thequery expression, and present their result in the form of a ranked listof elements. Every element is usually further characterized by ahyperlink pointing to a Web Page and of additional descriptions of thecontent of the Web Page. Elements are ranked by the search engine withan element score, and presented on the computer interface in rankingorder; the computer interface comprises a given number of elements, andnavigational commands on the interface allow the user to extract moreelements, or to change the query, or to follow one of the linksassociated with one of the elements.

An exemplary search engine is the Google® search engine. Such systemuses a ranking technique, called the PageRank algorithm, which gives toeach Web Page a score which uses as a measure of the relevance of suchWeb Page a metrics related to relevance of other Web pages containinghyperlinks pointing to it. Reference is made to L. Page, et. al., “ThePageRank citation ranking: Bringing order to the web”, technical report,Stanford Digital Library Technologies Project, 1998. PaperSIDL-WP-1999-1020. The PageRank algorithm has proven to be veryeffective for general-purpose search, i.e. for queries which have noassociated, predefined domain of interest.

In addition to general-purpose search engines, a number of searchengines exist which are specifically dedicated to given domains.Examples of broad domains are: travels, books, cars; examples ofnarrower domains are research centers within a given field or country,hospitals in a given city. Examples of domain-specific search enginesfor travels are: Expedia®, EasyFly® and TravelAdvisor® search engines.Domain-specific search engines outperform general purpose search enginesfor domain-specific queries, because they use specific knowledge abouttheir domains of interest. In the case of travels, they can useinformation about travelling time, fares, connections, and so on inorder to inform the user about the “best” travel combination, where“best” is further characterized according to the user indications, whomay be interested in aspects such as total cost, total travel duration,desired departure and arrival times, and so on.

The background material for the method is classified into threecategories, which are analyzed next. The first one is concerned withextracting the best document from a document collection where severalrankings are possible; the second one concerns merging search resultswhere each result is ranked; and the third one concerns multi-domainqueries.

A vast amount of work has been performed for addressing the issue ofextracting the best documents from a document collection upon whichseveral rankings are available. Examples include Fagin's Algorithm (FA),as described in Ronald Fagin, “Combining Fuzzy Information from MultipleSystems”, Journal of Computer and System Sciences, 1999, Volume 58(1):83-99, the Threshold Algorithm (TA) as described in Ronald Fagin, et.al., “Optimal Aggregation Algorithms for Middleware”, IBM ResearchReport RJ 10205, 2000, pp. 1-40, the Quick-Combine Algorithm (QA) asdescribed in Ulrich Guntzer et. al. “Optimizing Multi-Feature Queriesfor Image Databases”, Proceedings of the Very Large Data Bases (VLDB)Conference, Cairo, Egypt, August 2000, pp. 419-428, and the HRJNalgorithm as described in Ihab F. Ilyas et al., “Supporting top-k joinqueries in relational databases”, VLDB Journal, 2004, Volume 13(3);207-221.

FA considers a collection of elements (such as textual documents) andassumes that several distinct rankings can be used for extracting theelements from the collection. Accordingly, elements can either beaccessed by a sequential access, according to one of the variousrankings, or by a random access, using specific information of eachelement, which is different in every element and therefore constitutesan element identifier; the FA algorithm assumes a computer system thatcan support both sequential accesses and random accesses. Theaforementioned reference illustrates the FA algorithm in the case wherethe sequential access costs are identical and the random access costsare also all identical. Each element is associated with an overallelement score, defined as a monotone aggregation function of the scoresof the element in the available rankings. Then, the purpose of the FAalgorithm is extracting the “top K” elements of the collection, i.e. theK elements with maximal overall element score, by minimizing the cost ofextraction; instead of reading all elements from all the ranked lists.FA starts accessing elements by making sequential accesses and stopswhen K common elements have been found, and then performs additionalrandom accesses in order to guarantee that the set of elements that areaccessed, either by sequential or random accesses, include the “top K”elements, that can therefore be presented as the output of the FAalgorithm.

In TA, sequential accesses are made to each ranked list to retrieveelements and their element score, and, for each retrieved element, arandom access is made to retrieve the element score of that element onthe other ranked lists so as to determine the element's overall score,which is computed via a given monotone aggregation function combiningthe individual element scores of the elements in the available rankings.Element retrieval is stopped as soon as there are K elements with anoverall score that is below a threshold computed via the aggregationfunction over the element scores of the last seen elements in eachranked list. QA uses a similar idea as TA but attempts improving theglobal cost, by reading more elements from the less expensive rankings.

HRJN is an operator that addresses the rank-join problem, i.e., theproblem of computing joins in top-k queries. It is an extension of TA tothe rank-join problem, in which the goal is to compute the top Kcombinations of elements that match on a given subset of theirproperties (join attributes).

The need for modular scoring systems for merging search results in thecontext of document bases, of Intranets, and of the World Wide Web isthe objective of the U.S. Pat. No. 7,257,577 B2, August 2007. Themodular scoring system merges search results into an ranked list ofresults using many different features of documents. The block diagram ofthe high-level architecture of the modular score system, illustrated inFIG. 2, includes scoring modules based upon the indexing of textualproperties of the documents (such as content, title, and anchor text) aswell as processors which use generic document properties, such as theirpage rank; indegree; discovery date; URL words, depth, and length; andgeography. For example, in one of the proposed approaches, a rankaggregation processor uses a graph method that uses as input, for everydocument, its position in the ranking; the algorithm operates uponcollections of edges from documents to positions, where every edge <D,P> defines the cost of ranking the document D in position P. The methoduses a minimum-cost perfect matching to assign a unique score to eachdocument, thereby building a global ranking.

The problem of combining multiple ranked lists into a single ranked listis considered in the following references. In the Patent Application US2006/0190425, August 2006, a framework for incrementally joining rankedlists, while minimizing memory constraints and disk or memory swappingcosts, is presented. The framework focuses on a specific aspect of thearchitecture of a computer system and does not take into account thearticulation of the methods and systems available on the Web. In theU.S. Pat. No. 6,728,704 B2, April 2004, a method and apparatus formerging result lists from multiple search engines is presented. Themethod operates on sub lists which are produced by a given queryindependently performed upon many search engines, and merges such sublists into a single list by first computing the average score of thelist elements, then extracting those elements from the list with highestscore and simultaneously reducing the length of the list by one. Theresult is therefore a merged list of the sub lists extracted from everySearch Engine, without modifications to individual entries of the list.

While general-purpose search engines and domain-specific search enginesaddress the needs of many users for locating pages in the World WideWeb, they are not performing well when a user presents a multi-domainquery, i.e., a query which addresses multiple domains at the same time.Such queries require information extracted from search engines and datasources relative to two or more domains, such as travels and musicalevents, or care centers and doctor specializations and insurancecoverage.

For these queries, which are classified as multi-domain queries,specific query management methods are designed. Reference is made toDaniele Braga, et. al., “Optimization of Multi-Domain Queries on theWeb”, Proceedings of the Very Large Data Bases (VLDB) Conference,Auckland, New Zealand, August 2008, pp, 562-573, where the notion ofMulti-Domain Query is first introduced, and a model for their managementis presented. A multi-domain query is received through a user interfaceand presented to Search Engines, which extract ranked lists of elements.The model presents a collection of operations for manipulating suchranked lists; operations collectively constitute a computer program thatproduces an answer to the multi-domain query. The aforementionedreference presents also a collection of approximate (heuristic) methodsfor selecting a Query Plan for a given multi-domain query, where a queryplan is a well-defined chain of requests upon selected Search Enginesfor answering the query. The query plan selection is based upon theassociation of each operation to costs of execution. An importantoperation is the join of the ranked lists produced by two SearchEngines; reference is made to Daniele Braga, et. al., “Joining theResults of Heterogeneous Search Engines”, Information Systems, Vol. 33,Issues 7-8, November-December 2008, pp. 658-680, where severalsub-methods for performing such join are described; such sub-methods areused by the aforementioned model. The model is effective for giving afirst approximation of the solution of the multi-domain query answeringproblem, but it does not provide an optimal solution, i.e., one whichminimizes the cost of access to Search Engines.

The present invention describes a new algorithm that extends the FAalgorithm to the context of joins between search engines, thus alsoproviding a solution to the rank-join problem. The characteristics of FAmake it possible for the present invention to the rank-join problem todetermine, at query formulation time, the optimal execution strategy fora query, even when information on the distributions of the scores of theelements returned by the search engines is not available. This isparticularly relevant for the context of search engines over theInternet dealt with by the present invention, where such distributionsare generally unknown or, if known, they would be typically subject tochange. Another important aspect regarding the present invention is thatthe optimal execution strategy determined by the present invention isindependent of the aggregation function that is used to combine theelements scores in a global combination score. This indicates that noextra access to the search engines is required upon modifications ofaggregation function, such as changes of the weights in a weighted sum.Note instead that determining an optimal execution strategy withrank-join algorithms based on TA, including HRJN, necessarily requiresknowledge or assumptions on both the score distributions and theaggregation function.

SUMMARY OF THE INVENTION

The present invention deals with answering multi-domain queries; itpresents a method and a computer program product for identifying thedomains, selecting for each domain one domain-specific search engine anddata source to be involved, generating the domain-specific subqueriesfor each selected search engine, defining a strategy for sendingrequests to each search engine and data source, and receiving, mergingand ranking results, while minimizing the cumulative cost of requests tothe search engines and data sources. The result of the multi-domainquery is a list of combinations, where every combination consists of atuple of data, each relative to one of the domains of the query; suchdata is present in the results returned either by search engines or bydata sources. The proposed method provides the “top K” combinations,i.e. the combinations having the highest combination score, as computedby a monotone aggregation function over said combinations. Therefore,the method is essential in order to provide effective management ofmulti-domain queries by a vast number of users, currently usingsingle-domain or generic search engines.

BRIEF DESCRIPTION OF DRAWINGS

The various features of the present invention will be progressivelydescribed in greater detail with reference to the following description,claims, and drawings, wherein reference numerals are reused, whereappropriate, to indicate the correspondence between the referenceditems, and wherein:

FIG. 1 is a schematic illustration of an exemplary operating environmentfor computing the best K results of a multi-domain query in which thepresent invention can be used;

FIG. 2 is a flow chart illustrating the process for computing the best Kresults of a multi-domain query in the context of the exemplaryoperating environment of FIG. 1 according to the present invention;

FIG. 3 is a block diagram of the high-level software architecture of thepresent invention;

FIG. 4 is a flow chart illustrating in more details the process forcomputing the best K results of a multi-domain query; and

FIG. 5 is a flow chart illustrating in maximum detail the process forcomputing the best K results of a multi-domain query of FIG. 4 in thecontext of the high-level software architecture of FIG. 3.

DETAILED DESCRIPTION

The following definitions provide background information pertaining tothe technical field of the present invention, and are intended tofacilitate the understanding of the present invention without limitingits scope:

Query: a set of keywords or phrases, possibly separated by logicalconnectives, submitted by the user. A query submitted to the systemcauses the computation of k combinations, where k is a parametersubmitted by the user together with the query.

Subquery: a subset of the query's keywords or phrases, possiblyseparated by logical connectives, which is sent to a specific searchengine.

Search engine: A system that returns a ranked list of elements inresponse to a subquery. The search engine uses internal methods,information about Web resources and possibly other factors, such asuser's preferences, language, geographical location etc., in order toreturn elements ranked according to a measure of their relevance, calledelement score.

Rank: The position of an element in the ranked list returned by a searchengine.

Element: An item returned by a search engine or data source, which cantake various forms depending on the nature of the search engine or datasource. Elements may include attribute-value pairs, links to web pagesand, in some cases, documents and image files, audio and video streams,etc. Each element describes a given domain of the query.

Element score: The numerical value associated, either implicitly orexplicitly, by a search engine or data source to every element inresponse to a query. It measures the relevance of the element.

Data source: A system that returns a set of elements in response to arequest. The request consists of one or more attribute-value pairs. Theelements in the response comprise attribute-value pairs which are equalto the attribute-value pairs in the request.

Request: an interaction with a search engine or with a data source. Theinteraction with a search engine for answering a given subquery requiresseveral requests, where the response to every request is a chunk ofelements, as search engines may produce very long lists of results;however, a limited number of requests is normally sufficient in order toextract, in their responses, the elements which contribute to the top-kcombinations. The interaction with a data source requires only onerequest, consisting of one or more attribute-value pairs. The elementsin the response comprise attribute-value pairs which are equal to theattribute-value pairs in the request.

Chunk: Sub-list of consecutive elements from a ranked list of elements.Typically, results from a request sent to a search engine are returnedin subsequent chunks of a fixed maximum size.

Sequential access: A request sent to a search engine, whose responseconsists of a chunk of elements. The first sequential access retrieves achunk containing the first elements of the ranked list produced by thesearch engine. Each subsequent sequential access retrieves the nextchunk of elements in the ranked list.

Attribute-based access: A request sent to a data source that transmits avalue for some given attribute handled by the data source. The responseto the request consists of the set of elements that match the suppliedvalue in the given attribute.

Combination: n-ple of elements, such that each element is associatedwith one domain of the query and is present in the results producedeither by search engines or by data sources. The elements in acombination satisfy equality predicates on specific element attributes,called join attributes.

Combination score: The numerical value attributed to the combination, asthe global aggregation of the element scores attributed by each searchengine to each element. For each query, such global aggregation is agiven, monotone aggregation function.

Top-k combinations: The combinations associated with the k highestcombination scores. Top-k combinations are the result of the query.

The search engine and data sources hosting a given element, mentioned inthe above definitions, may be part of the same computing system, e.g., asoftware service offering different interfaces supporting sequential andattribute-based access, or instead they may be located upon differentcomputing systems. The distinction between search engines and datasources, which is on purpose made very sharp in the above definitions,is less clear in many Web contexts, where the same computing system makeact as search engine or as data source depending on the particularinvocation. Therefore, the invention covers also the case of systemswhich are generically classified as data sources due to their maincondition of use but which may act at times as search engines, becausethey provide sequential access—and then if the multi-domain query usessequential access such systems are equivalent to search engines for thepurpose of this invention; it likewise covers the case of systems whichare generically classified as search engines due to their main conditionof use, but which may act at times as data sources, because they provideattribute-based access—and then if the multi-domain query usesattribute-based access such systems are equivalent to data sources forthe purpose of this invention.

It is to be noted also that the answer to a multi-domain query is a listof combinations, and every combination is an n-ple of n elements, wherefor every index i←n, the i-th element of the combination is extractedfrom exactly one domain-specific search engine or data source. Everycombination is thus the result of n−1 join operations between elements;for any two pairs of elements which are joined, the attributes which areused in the join are denoted as “join attributes”. The identification ofthe joins which are required to build a combination is an immediateconsequence of the identification of the domains which are required bythe query and of the search engines and data sources which are used foreach above mentioned domain.

FIG. 1 portrays an exemplary operating environment 5 in which a userfrom within a client system 10 submits a query 15 and the number K ofresults that must be produced. The query 15 and the number K aresubmitted through a browser 20 to a server system 25.

The system 25 comprises a multi-domain search engine integrator system30. The multi-domain search engine integrator system 30 is able toreturn a result 34, which comprises the top-K combinations. The result34 of the query 15 is displayed by the browser 20. To compute the result34, the system 30 invokes a plurality of search engines 35-39 and datasources 45-48 available through an internet access 40.

FIG. 2 illustrates a process flow 100 in which the user 10 submits thequery 15 and the number K of results that must be produced.

The multi-domain search engine integrator system 30 comprises: a queryanalysis and search engine selection component 205, a query processorcomponent 215, a score computation component 220 and a storagemanagement component 210. The system 25 comprises also a cost model 150and a result presenter 160.

The query 15 is processed by a query analysis and search engineselection component 205. Said component 205 decomposes the query insubqueries 120, 121, and 122. Each subquery is associated with adifferent domain and each domain is associated with a single searchengine 35-39 and with a single data source 45-48, based on parametersavailable from a storage management component 210.

Said subqueries 120-122 are passed on to the query processor component215, which uses Internet Access 40 in order to produce requests to thesearch engine 35-39 and data sources 45-48. The requests are sentaccording to an optimal strategy that minimizes the expected total costassociated with the requests sent to search engines 35-39 and datasources 45-48, based on a cost model 150 establishing how to computesaid expected total cost.

Results in the form of lists or sets of elements are then received fromsaid search engines 35-39 and data sources 45-48. Said lists or sets ofelements are elaborated by the combination and score computationcomponent 220, which computes the combinations that can be formed withthe lists or sets of elements and the corresponding combination scores.Component 220 computes a plurality of combinations so as to include thetop-K combinations. It then associates to each combination a combinationscore and ranks each obtained combination in function of the combinationscore. Further it extracts the K combinations with highest combinationscore, so as to minimize the global cost of the query, which is afunction of the costs of requests addressed to search engines and ofdata sources. The K combinations with the highest combination score arepassed to the result presenter 160 that shows said combinations forminga result 34 to the user.

It is not considered here the case where the data available to thesystem are not sufficient to form K combinations. This case isconsidered in the more detailed description given for FIG. 5.

FIG. 3 illustrates a block diagram of the high-level architecture of themulti-domain search engine integrator 30. The query analysis and searchengine selection component 205 is responsible for parsing the inputquery 15, in order to determine the domains involved and, for eachdomain, to identify the search engine and data source that need to beinvoked. The store management component 210 comprises a storage mediumsuch as a hard drive or like devices where auxiliary data needed to thefunctioning of system 30 are stored, including descriptions of domains,search engines, and interfaces. The store management component 210 isalso used to store the results produced by the query processor component215, which comprises two modules. An interfaces component 225 comprisesmodules for sequential access 230 and attribute-based access 235, whichperform requests to search engines and data sources respectively. Acombination and score computation component 220 builds combinations ofelements obtained by the invoked search engines and data sources; eachcombination also comprises the corresponding element scores as well asthe combination score, which is computed via a monotonic aggregationfunction of the element scores.

FIG. 4 illustrates a process flow 300 to compute, in the operatingenvironment 5, the best K results of a multi-domain query that requiresintegrating the answers of multiple search engines and data sources.

The user query 15, consisting of search terms, and a desired number K ofresults are provided by a user (step 305). The user query is decomposedinto different sub queries 120-122 (step 310), each consisting of termspertaining to some specific domain as determined based on domaindescriptions available from the storage management component 210. Basedon the sub queries and on search engine descriptions at the storagemanagement component 210, the relevant search engines and data sourcesto be invoked during query execution are selected from the availablesearch engines 35-39 and data sources 45-48 (step 315).

The search engines selected in step 315 are then used within aniterative method which has the objective of determining the top-Kcombinations satisfying the user's query. At each iteration, one searchengine is chosen (step 320), for subsequent invocation (step 325); thechoice of the search engines (step 320) is performed by using an optimalstrategy that maximizes the number of combinations that can be formed byusing the new elements retrieved thereby minimizing the expected totalcost associated with the invocations of search engines and data sources.The effect of the search engine invocation is to return ranked lists ofelements. From said lists, new combinations are formed and theassociated combination scores computed (step 330).

The number of formed combinations is compared to K (step 335). Theprocess continues with step 320 in case the number of formedcombinations is less than K, since not enough combinations candidate tobe the top K combinations have been found. If instead the number offormed combinations is greater than or equal to K, the iterative part ofthe method is complete. Like in FIG. 2, it is not considered here thecase where the data available to the system are not sufficient to form Kcombinations. This case is considered in the more detailed descriptiongiven for FIG. 5.

The method then uses the set of candidate combinations extracted so faras the start point for generating attribute-based access requests, so asto compute more combinations and associated combination scores in a waythat guarantees that the so extended set of candidate combinationsincludes the top K combinations (step 340). Said queries aresubsequently executed (step 345). From the set of all combinationsformed so far, the K combinations with highest combination scores areselected (step 350) and then presented as a query result 34 to the user(step 355).

FIG. 5 illustrates the execution process of the proposed invention anddetails how the architectural components defined in FIG. 3 participateto the process illustrated in FIG. 4.

For the sake of clarity, in a simple running example a user submits tothe system a query requesting the best two combinations of hotels andrestaurants in Paris, located in the same street. While in the examplethe query is answered by few combinations for ease of reading, inreality the query produces many combinations.

The query 15 together with the number of desired combinations K is fedto the query analysis and search engine selection component 205. In oneembodiment, the query comprises the specification of an aggregationfunction for aggregating the element scores in order to form acombination score. In another embodiment, a suitable aggregationfunction is adopted after the selection of search engines, which is nextdescribed.

In the running example, it is assumed K=2, and the scores may berepresented for the hotels by the official international star rating forhotels (from no star to five stars), and for the restaurants the scoresmay be expressed as the average evaluation (on a scale from 0 to 4)given by a community of customers reporting their assessment to one thesearch engines used by the system. As aggregation function the averageof the two scores is considered, normalized over a [0,1] interval.

First, the query is decomposed into subqueries 120-122 by the querydecomposition and domain selection module 404, in step 310. In oneembodiment, the mapping from query terms to subqueries and to domains ispredetermined, as the user is asked to enter query terms into predefinedforms and each form is mapped to a predetermined subquery and each formfield to a predetermined domain. In another embodiment, the mappingbetween query terms and domains is performed automatically by means ofmatching algorithms between query terms and domain descriptions 406available in the storage and management component 210, and such matchingalgorithm also builds the subquery associated with each domain; thematching algorithm is outside the scope of the present description. Inboth cases, the query decomposition and domain selection module 404produces a set of subqueries 120-122, each associated with a specificdomain. For each subquery, one relevant search engine is selected (step315) for invocation, by using search engine descriptions 412 availablein the storage and management component 210, thereby associating eachsubquery with exactly one search engine, and also associating onecorresponding data source that provides attribute-based access to theelements of the search engine. In one embodiment, the relevant searchengines are selected manually by the user that submits the query. Inanother embodiment, the search engines are selected automatically, bymatching the domain of the subquery with the domain covered by thesearch engines, as defined in the search engine descriptions.

In the running example, two subqueries correspond to a search for goodrestaurants and a search for good hotels, associated to the domains ofgastronomy and touristic accommodations. For the sake of the example,two services, namely Hotels and Restos, are assumed as available on theWeb, both allow for sequential as well as attribute-based access. Thus,both services behave as search engines, when invoked sequentially, andas data sources, when invoked based on attribute values.

The execution of the query is governed by the query processor component215, which invokes the search engines and integrates the results in theform of element combinations. The result is obtained iteratively; ateach iteration, one or more search engines are chosen (step 320), theselected search engines are invoked and each one returns a chunk ofelements. The specific order of invocation is determined by the queryprocessor component 215. Said specific order of invocation is optimallychosen; optimality consists in minimizing the expected total cost ofsending requests to search engines and data sources, which depends onthe number and cost of each invocation. Therefore, the optimization isbased upon estimates of the costs of invocation of the search enginesand data sources which depend on the sizes of their chunks, theestimated number of occurrences of the elements indexed by the searchengines, and the cardinalities of the sets of elements satisfyingattribute-based accesses, which are all available in the interfacedescriptions parameters 418.

In the running example, both search engines are chunked, and sequentialaccesses to Hotels are more costly than sequential accesses to Restos,while for attribute-based accesses the opposite holds. The cardinalitiesof the two services are exemplified taking true values from actualservices at the time of writing: 858 indexed hotels in Paris, located in592 different streets, and 1198 restaurants, in 977 distinct streets,188 of which are in common with those of the hotels. For the sake ofexemplification, Hotels returns chunks of size 2, while Restos returnschunks of size 3.

In order to determine the optimal order of invocation at step 320, asdescribed in the technical report “Cost Aware top-k join algorithms”, byS. Ceri, D. Martinenghi, and M. Tagliasacchi [tech. rep. n^(o) 2009.16at Politecnico di Milano, Dipartimento di Elettronica e Informazione,July 2009], the execution access strategy can be formulated as aconstrained optimization problem with integer variables, whereby themethod determines the number of accesses to the search engines, so as tomaximize the number of combinations that can be formed, subject to atarget cost constraint. In particular, the number of combinations can beapproximated by the product of the numbers of sequential accesses to thevarious search engines, multiplied by a constant factor, that depends onthe aforementioned parameters, and that is therefore negligible for themaximization problem at hand. The cost function is expressed, instead,by summing the costs incurred by the sequential accesses and thoseincurred by the attribute-based accesses. The former ones can beobtained by multiplying the number of sequential accesses made on eachsearch engine by its sequential access costs. The latter ones can beobtained by multiplying the number of distinct join attributes retrievedfrom each search engine by the sum of the attribute-based access costsof all the other data sources that participate in the join. Said costfunction is constrained to be within a given cost threshold, indicatingthe amount of available resources. An approximate solution can be foundby relaxing the integer constraint and solving the Karush-Kuhn-Tuckerequations for the relaxed problem. Instead of solving said maximizationproblem under said cost constraint for a fixed value of the threshold,the threshold is varied over a positive interval, so as to obtain alocus of points that represents the access strategy that maximizes thenumber of retrieved combinations for each step of the algorithms. Inparticular, said trajectory also determines which search engines toaccess at each step of the algorithm.

In the running example, the optimal strategy consists in a sequence ofaccesses characterized by the fact that such sequence includes choicesof search engine to be invoked such that more sequential requests aremade to Restos than to Hotels, because it is cheaper to make sequentialaccesses to Restos and it is also cheaper to make the correspondingattribute-based accesses to Hotels.

In step 320, the query processor component 215 invokes the searchengines, chosen by the optimization method from the available searchengines 35-39, by presenting the associated subqueries as request 420and receiving new chunks of ranked elements 422 in response; the chunksize depends on the chosen search engine. For every search engine andassociated subquery, the first chunk contains the top elements in thesearch engine's ranking; the subsequent chunks contain new elementswhich immediately follow, in the search engine's ranking, the elementswhich were already retrieved for that subquery, and were stored withinan appropriate repository 426. Module 424 adds the new chunks to therepository 426.

At the end of each iteration, if the received response 422 contains someelements, then module 424 passes the control to step 330, which computescombinations and combination scores by joining the elements and bycomputing the aggregation function associated with the query, and storeseach combination with its score in an appropriate repository 434 in theStorage Management Component 210. If the number of combinations found issmaller than K (step 335), the process is re-iterated, by choosing thesearch engines to be invoked, at step 320.

If, at the end of an iteration, the received response 422 contains noelements, then the invoked search engine is marked as emptied (step454), because all of its elements have been retrieved.

If a search engine is marked as emptied right after the firstinvocation, said marking is considered critical, since this means thatno combination element can be formed. In this scenario, therefore, thequery has failed and an alert is issued by module 456, and K is set tozero.

If all search engines are marked as emptied, said marking is alsoconsidered critical, since this means that no more combinations can beformed, because all possible elements resulting from requests to allsearch engines have been retrieved. In this case, it is not possible toobtain K combinations, because there exist only H combinations, where His the number of combinations obtained so far, being H minor than K.Therefore, module 456 alerts the user about this circumstance and letsK=H. If H is equal to zero, the query fails. Otherwise, if some searchengines are marked as emptied but some other search engines are notmarked as emptied, then such situation is not considered critical, sinceby sending requests to the search engines which are not emptied moreelements can be received and more combinations can be formed. In suchcase, module 454 passes the control back to step 320, but no requestwill then be sent to those search engines which are marked empty.

If the number of combinations is greater than K, there is still noguarantee that such combinations include the top K combinations;however, there is already a guarantee that such top K combinations canbe found by computing additional combinations obtained by joining theelements already retrieved from search engines and stored in the elementstorage 426 with elements that will be produced as responses to requestsfrom data sources; therefore, no more sequential accesses to searchengines are needed, and only attribute-based accesses to data sourcesare required.

In the running example, K=2 combinations have been found after onesequential access to Hotels and two sequential accesses to Restos, andthat the repository 426 contains the following results, where differentchunks are separated by a double line:

Hotel Hotel Hotel name street score H1 S1 5 H2 S2 4 Rest. Rest. Rest.name Street Score R1 S3 3.9 R2 S1 3.8 R3 S4 3.7 R4 S3 3.7 R5 S2 3.6 R6S5 3.5

The above results determine the two following combinations, which arestored in the repository 434:

Hotel Rest. Hotel Rest. Combined name Name Score Score Street score H1R2 5 3.8 S1 0.975 H2 R5 4 3.6 S2 0.85

These are not necessarily the top K combinations, now there is theguarantee that each of the top 2 combinations is formed with at leastone of the hotels and restaurants retrieved sequentially so far, i.e.,H1, H2, R1, R2, R3, R4, R5, and R6.

Then, at step 438, the set of attribute-based accesses that need to beperformed in order to compute all candidate combinations (and associatedcombination scores) containing the top-k combinations are determined. Inparticular, if E is an element already responded by a search engines andstored in the element storage 426, and the data source S can be joinedwith E, and Vi are the values of the join attributes Ai used in the joinbetween said element E and data source S, then an attribute-based accessis directed to the data source S, such that the attribute-based accessrequest consists of attribute-value pairs <Ai, Vi>. Said accesses arethen executed via the interface for attribute-based access 235 bypresenting a request 440 to the relevant data source and receiving a setof elements 444 in response.

In the running example, attribute-based accesses to Restos are requiredwith values S1 and S2, and to Hotels with values S1, S2, S3, S4, and S5.

These new elements, together, if available, with their element scores,are fetched and stored by a module 446 in a repository 448 provided inthe storage management component 210. Step 345 computes new combinationsand combination scores by joining the new elements and by aggregatingtheir element scores retrieved by the different search engines and datasources via the aggregation function associated with the query; the newcombinations are stored in an appropriate repository 452 in the storagemanagement component 210.

In the running example, the attribute-based accesses return thefollowing elements, stored in the repository 448, where differentaccesses are separated by a triple line.

Hotel Hotel Hotel name street score H1 S1 5 H2 S2 4 H3 S5 2 H4 S5 1Rest. Rest. Rest. name Street Score R2 S1 3.8 R7 S1 3.4 R5 S2 3.6

The attribute-based accesses reveal that there are two new hotels (H3and H4) in street S5, but no new hotels in the other streets S1, S2, S3and S4. Also, there is a new restaurant (R7) in street S1, but no newrestaurants in street S2. Therefore, the following three newcombinations are stored in repository 452:

Hotel Rest. Hotel Rest. Combined name Name Score Score Street score H3R6 2 3.5 S5 0.6375 H4 R6 1 3.5 S5 0.5375 H1 R7 5 3.4 S1 0.925

From the set of all combinations, the K combinations with highestcombination scores are extracted in step 350, then stored in anotherrepository 460 provided in the Storage Management Component 210, andfinally presented as a query result 34 to the user by module 160.

In the running example, the top K combinations are the following ones:

Hotel Rest. Hotel Rest. Combined name Name Score Score Street score H1R2 5 3.8 S1 0.975 H1 R7 5 3.4 S1 0.925

1. A method for extracting, merging and ranking results produced byexecuting multi-domain queries over “n” domain-specific search engines,said method comprising: obtaining a query and an expected number ofresults K by the user through a user interface; decomposing said queryinto “n” subqueries and associating each subquery with exactly onedomain of the “n” domain-specific search engines; selecting one searchengine for each said domain and one data source for each said searchengine, wherein the selection of search engines used for answering thequery determines also a strategy for building a combination of “n”elements among those returned as result from the “n” domain-specificsearch engines and data sources, wherein every combination is producedas result of n−1 join operations between elements, such that everycombination comprises exactly one element for each domain; wherein themethod comprises the further steps of: performing sequential accesses toeach said search engine in function of the associated subquery andreceiving in response results consisting of lists of elements;performing attribute-based access to each said data sources in functionof results responded by search engines and receiving in response resultsconsisting of sets of elements; computing a plurality of combinations,each of which comprising “n” elements; associating with each combinationa combination score and ranking each element combination in function ofthe combination score; extracting the expected number of results K ofcombinations with the highest combination score, so as to minimize aglobal cost of execution of said query, the cost being a function of thecosts of requests to search engines and to data sources.
 2. The methodof claim 1, wherein the association of a score to each combination isperformed by aggregating the element scores retrieved by the differentsearch engines and data sources by an aggregation function, the latterbeing associated with the query.
 3. The method of claim 2, wherein thestep of performing sequential access to the selected search enginesinitially performs one request to every search engines, and receives asresult a first sub-list of elements produced by every search engine asanswer to the requests.
 4. The method of claim 1, wherein the step ofcomputing a plurality of combinations comprises the strategy forbuilding combinations of elements by using said sub-lists of elements,thus producing the first set of combinations, and further the step ofstoring said combinations are stored in the memory associated with thequery.
 5. The method of claim 1, wherein if the number of combinationsis less than the number of results K then: a. one or more of the searchengines are chosen; b. a sequential access is performed to every saidchosen search engine, receiving as result new sub-lists of elementsresponding to the access, and said elements are stored in memoryassociated with the query; c. the strategy for building combinations ofelements is applied to the said new sub-lists of elements and the listsof already available elements, producing new combinations, and saidcombinations are stored in memory associated with the query.
 6. Themethod of claim 5, wherein after step b. those search engines that havereturned no elements are marked as emptied.
 7. The method of claim 5,wherein after step b. if all search engines are marked as emptied, thena message informs the user that it is not possible to form the number Kof combinations, and therefore the result will include only Hcombinations, where H is the number of currently formed combinations andif H is equal to zero the query fails.
 8. The method of claim 5, whereineither after step c. at least the number K of combinations are produced,or after step b, only H combinations are produced, and K is set equal toH.
 9. The method of claim 8, wherein every distinct value V of everyjoin attribute of every said K combinations is used for determination ofrequired attribute-based accesses, and then all said attribute-basedaccesses are performed, producing as result new sets of elements, andsaid elements are stored in memory associated with the query.
 10. Themethod of claim 1, wherein the strategy for building combinations ofelements is applied to the said new sub-lists of elements and the listsof already available elements, producing new combinations, and saidcombinations are stored in memory associated with the query.
 11. Themethod of claim 10, wherein the K combinations with the highest scorepresent in memory are presented as the result of the query.
 12. Acomputer program product comprising a set of executable instructioncodes stored on a computer readable storage medium, wherein said set ofexecutable instruction codes performs the steps of a method forextracting, merging and ranking results produced by executingmulti-domain queries over “n” domain-specific search engines, the methodcomprising: obtaining a query and an expected number of results K by theuser through a user interface; decomposing said query into “n”subqueries and associating each subquery with exactly one domain of the“n” domain-specific search engines; selecting one search engine for eachsaid domain and one data source for each said search engine, wherein theselection of search engines used for answering the query determines alsoa strategy for building a combination of “n” elements among thosereturned as result from the “n” domain-specific search engines and datasources, wherein every combination is produced as result of n−1 joinoperations between elements, such that every combination comprisesexactly one element for each domain; wherein the method comprises thefurther steps of: performing sequential accesses to each said searchengine in function of the associated subquery and receiving in responseresults consisting of lists of elements; performing attribute-basedaccess to each said data sources in function of results responded bysearch engines and receiving in response results consisting of sets ofelements; computing a plurality of combinations, each of whichcomprising “n” elements; associating with each combination a combinationscore and ranking each element combination in function of thecombination score; extracting the expected number of results K ofcombinations with the highest combination score, so as to minimize aglobal cost of execution of said query, the cost being a function of thecosts of requests to search engines and to data sources.
 13. A computerprogram product according to claim 12, wherein it comprises further: afirst set of instruction codes for reading the query from an interface;a second set of instruction codes for storing the data which is used forresponding to the query; a third set of instruction codes for processingthe query and writing the results of the query on an interface presentedto the user.
 14. The computer program of claim 13, wherein the saidfirst set of instruction codes for reading the query comprises: a set ofinstruction codes for reading the number K of expected results by theuser from the interface; a set of instruction codes for decomposing thequery into subqueries and associating each subquery with one domain; aset of instruction codes for selecting one search engine for each domainand one data source for each selected search engine.
 15. The computerprogram of claim 13, wherein the data used for responding to the querycomprise: data existing prior to the query execution, comprising: domaindescriptions; search engine descriptions, said descriptions includingthe association of search engines to their domain of specialization;data source description, including the association of data sources tosearch engines; and interface description parameters, said parametersincluding the average cost of each request to a search engine or datasource; data produced during query execution, comprising elements withproperties, said properties consisting of pairs including one attributeand one value; and combinations of elements with their element scores.16. The computer program of claim 13, wherein the said instruction codesfor processing the query comprise: instructions for choosing the searchengines to be invoked in the next step of said query processing method;instructions for presenting to said chosen search engines a new requestand for receiving the response, consisting of an element sub-list, andfor storing said sub-list of elements in the memory associated with thequery; instructions for computing the new combinations and combinationscores which are produced by using the said new sub-lists of elementsand the already available lists of elements, and for storing saidcombinations and combination scores in the memory associated with thequery; and instructions for testing if existing combinations are lessthan K, and repeating until existing combinations are at least K or allexisting elements have been retrieved without producing K combinations,but producing instead H combinations, with H<K.
 17. The computer programof claim 14, wherein the said instruction codes for processing the querycomprises in addition: instructions for determining the attribute-basedqueries that should be addressed to the data sources; instructions forperforming said attribute-based queries to said data sources andretrieving new elements; instructions for storing said elements in thememory associated with the query; instructions for computing the newcombinations and combination scores which are produced by using the saidnew elements and the already available lists of elements, and forstoring said combinations and combination scores in the memoryassociated with the query; instructions for selecting the K combinationswith the highest combination score; and instructions for assembling thebest combinations into the query result and present them to the user.